Refactor unicode emoji parsing #20

freya022 · 2023-11-12T15:42:35Z

Depends on #19

Improves performance for getByAlias and sibling methods

# Conflicts: # lib/src/main/java/net/fellbaum/jemoji/EmojiManager.java

# Conflicts: # lib/src/jmh/java/benchmark/EmojiManagerAliasBenchmark.java # lib/src/main/java/net/fellbaum/jemoji/EmojiManager.java

felldo · 2023-11-23T10:09:22Z

I also had the idea to do it in a way like you did, but didn't do it due to performance and future extension ability concerns.
To the extension ability concern:
Currently this refactor isn't really suited because of the planned index method for emojis. You return the codepoint index which most likely most people don't want. I had a prototype which returned both the codepoint index and char index and this wouldn't really be possible with this refactor as there is a smal performance bottleneck and this would be applied to every method. To get the string index I had to use a StringBuilder to keep track of the already processed characters and then get the length of it when it detects an emoji. This not only consumes a bit more memory but is also slower as its another step.

When it comes to performance, the refactor also seems to be worse with some methods. These were my benchmark results with the changed parameters to 10 iterations and 3 forks in a try to get more accurate results:

//Current code base
Benchmark                                                             Mode  Cnt   Score   Error  Units
EmojiManagerBenchmark.extractEmojisInOrder                            avgt   30   2,157 ± 0,049  ms/op
EmojiManagerBenchmark.extractEmojisInOrderOnlyEmojisLengthDescending  avgt   30   9,752 ± 0,028  ms/op
EmojiManagerBenchmark.extractEmojisInOrderOnlyEmojisRandomOrder       avgt   30  10,300 ± 0,046  ms/op
EmojiManagerBenchmark.removeAllEmojis                                 avgt   30   2,915 ± 0,044  ms/op
EmojiManagerBenchmark.replaceAllEmojis                                avgt   30   2,949 ± 0,121  ms/op
EmojiManagerBenchmark.replaceAllEmojisFunction                        avgt   30   2,849 ± 0,008  ms/op
//Refactor
Benchmark                                                             Mode  Cnt   Score   Error  Units
EmojiManagerBenchmark.extractEmojisInOrder                            avgt   30   4,209 ± 0,980  ms/op
EmojiManagerBenchmark.extractEmojisInOrderOnlyEmojisLengthDescending  avgt   30   9,807 ± 0,045  ms/op
EmojiManagerBenchmark.extractEmojisInOrderOnlyEmojisRandomOrder       avgt   30  10,355 ± 0,044  ms/op
EmojiManagerBenchmark.removeAllEmojis                                 avgt   30   2,962 ± 0,041  ms/op
EmojiManagerBenchmark.replaceAllEmojis                                avgt   30   3,076 ± 0,034  ms/op
EmojiManagerBenchmark.replaceAllEmojisFunction                        avgt   30   3,238 ± 0,106  ms/op

So currently I would tend to not merge this PR

freya022 · 2023-11-24T09:49:36Z

After discussion, the above benchmark is erroneous (look at the huge margin of error in extractEmojisInOrder)

My runs were as follow:
Master:

Refactor:

PR #22 will require getting the char index of the emoji, which isn't necessary for most operations, so, this PR will be closed for that reason

While String#offsetByCodePoints could have been used when an emoji is received, it must travel most of the string until the specified offsets, manually keeping track of the character index ourselves is likely cheaper in these cases

freya022 added 8 commits November 9, 2023 11:48

Use maps of emojis by their aliases, by their alias group

715d344

Improves performance for getByAlias and sibling methods

Add getByAlias benchmark

15ab692

Improve EMOJI_CODEPOINT_COMPARATOR

d4c1ad1

Merge remote-tracking branch 'felldo/master' into refactor/performance

086b209

# Conflicts: # lib/src/main/java/net/fellbaum/jemoji/EmojiManager.java

Arrange benchmarks to avoid using unnecessary parameters

5be9d3a

Refactor unicode emoji parsing

7d6c131

Refactor EmojiManager#replaceEmojis

13d9ae4

Replace other functions

800fd03

freya022 mentioned this pull request Nov 12, 2023

Optimize getting code points of strings #21

Merged

freya022 added 2 commits November 16, 2023 14:22

Merge branch 'master' into refactor/unicode-emoji-parsing

4e103ee

# Conflicts: # lib/src/jmh/java/benchmark/EmojiManagerAliasBenchmark.java # lib/src/main/java/net/fellbaum/jemoji/EmojiManager.java

Remove EmojiManager#getCodePointCount

7ec8d8b

freya022 marked this pull request as ready for review November 16, 2023 13:24

freya022 closed this Nov 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor unicode emoji parsing #20

Refactor unicode emoji parsing #20

freya022 commented Nov 12, 2023

felldo commented Nov 23, 2023

freya022 commented Nov 24, 2023

Refactor unicode emoji parsing #20

Refactor unicode emoji parsing #20

Conversation

freya022 commented Nov 12, 2023

felldo commented Nov 23, 2023

freya022 commented Nov 24, 2023